"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV-2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.
This lesson is the sixth and final lecture in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.
The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.
This week will focus on commonly used visualizations related to genomic information. When working with sequencing data, you may commonly wish to compare sequences based on their relationships or relative similarity (trees), by their sequence identity in gene families or potential interactions (graphs), and or more directly their sequence motifs.
At the end of this lecture you will have covered the following topics
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
Today's datasets will focus on SARS-CoV-2 variant surveillance data from the Nextstrain website which has been tracking published sequenced genomes for the appearance of new strains mainly in North America.
This is a Newick format data set describing a phylogenetic tree of SARS-CoV-2 strain information.
Metadata that accompanies the first dataset. It links strain information back to as much geographical and related information as possible.
repr- a package useful for altering some of the attributes of objects related to the R kernel.
tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2
viridis helps to create color-blind palettes for our data visualizations
RColorBrewer has some hlepful palettes that we'll need to colour our data.
ggnewscale will be helpful in generating multiple colour palettes across increasingly complex plots.
treeio, tidytree, and ggtree will be used to help import, parse and plot phylogenetic trees.
tidygraph and ggraph will be used to generate network graph objects and plot them.
ggseqlogo is used for generating sequence logos related to sequence motifs.
qqman is a wrapper package for generating Manhattan plots.
lubridate and zoo help us to work with some date-based information.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# BiocManager::install("ggnewscale")
# install.packages("tidytree")
# BiocManager::install("ggtree")
# install.packages("ggseqlogo")
# install.packages("qqman")
# Packages to help tidy our data
library(tidyverse)
library(readxl)
# Packages for the graphical analysis section
library(repr)
library(viridis)
library(RColorBrewer)
library(ggnewscale)
# Packages for today's lecture about trees
library(tidytree)
library(ggtree)
library(treeio)
# Packages for graphs
library(ggraph)
library(tidygraph)
# Packages for sequence logos
library(ggseqlogo)
# Packages for Manhattan plots
library(qqman)
# Date calculation helpers
library(lubridate)
library(zoo)
-- Attaching packages --------------------------------------- tidyverse 1.3.0 -- v ggplot2 3.3.2 v purrr 0.3.4 v tibble 3.0.4 v dplyr 1.0.2 v tidyr 1.1.2 v stringr 1.4.0 v readr 1.4.0 v forcats 0.5.0 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() Loading required package: viridisLite Warning message: "package 'ggnewscale' was built under R version 4.0.5" Warning message: "package 'tidytree' was built under R version 4.0.5" Attaching package: 'tidytree' The following object is masked from 'package:stats': filter Registered S3 method overwritten by 'treeio': method from root.phylo ape ggtree v2.4.1 For help: https://yulab-smu.top/treedata-book/ If you use ggtree in published research, please cite the most appropriate paper(s): - Guangchuang Yu. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics, 2020, 69:e96. doi:10.1002/cpbi.96 - Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi:10.1093/molbev/msy194 - Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36. doi:10.1111/2041-210X.12628 Attaching package: 'ggtree' The following object is masked from 'package:tidyr': expand treeio v1.14.3 For help: https://yulab-smu.top/treedata-book/ If you use treeio in published research, please cite: LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package for phylogenetic tree input and output with richly annotated and associated data. Molecular Biology and Evolution 2020, 37(2):599-603. doi: 10.1093/molbev/msz240 Warning message: "package 'ggraph' was built under R version 4.0.5" Warning message: "package 'tidygraph' was built under R version 4.0.5" Attaching package: 'tidygraph' The following object is masked from 'package:stats': filter Warning message: "package 'ggseqlogo' was built under R version 4.0.5" Warning message: "package 'qqman' was built under R version 4.0.5" For example usage please run: vignette('qqman') Citation appreciated but not required: Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731. Attaching package: 'lubridate' The following objects are masked from 'package:base': date, intersect, setdiff, union Attaching package: 'zoo' The following objects are masked from 'package:base': as.Date, as.Date.numeric
Dendrograms, cladograms, and phylogenetic trees all share a similar structure with some modifications. All are represented in branching tree structures that are used to represent the relationships among the leaves, also known as tips. Branches along the tree may also be referred to as edges.
Leaves can represent different species, strains, or multi-dimensional values. Leaves are connected by branches to their nearest neighbours or relatives. As you move backwards along the tree, you encounter internal nodes which can connect directly to more tips or more nodes. The distance between tips and nodes represent a relative distance that may be defined as some type of evolutionary distance, time, or simple euclidean distance. In a cladogram, however, distances along branches have no meaning and only define the presence of a relationship.
In tree terminology, it is helpful to define a few more terms:
| Term | Description |
|---|---|
| Root | A tree is rooted when it defines a common ancester for all tips in the tree. This could be considered the start or entry point for the tree |
| Most recent common ancester (MRCA) | This is an internal node that collectively joins two or more tips. |
| Clade | A group of taxa (tips) that includes a common ancestor and all of its descendents. |
| Scaling | This indicates that the branch lengths do represent a distance metric, whereas, an unscaled tree may have even-length branches not representative of evolutionary/relationship distances. |
Diagram of phylogenetic tree terms from 10 differences between cladograms and phylogenetics trees
As we saw last week, clustering analysis such as that from hclust() can create dendrogram data. We plotted this in a couple of ways by itself and along with a heatmap. In the case of our first foray, we were looking at the "relationships" between samples by indicating how similar they were based on the characteristics in their various features.
Depending on the software used, trees can be represented in a number of formats, some of which are described below.
| Format | Description |
|---|---|
| Newick | The standard used to represent trees in a computer-readable format. Trees are encoded in a parentheses-closure format where each tip takes the form of "taxa:branch-length" and pairs of tips are separated by a comma , and enclosed by parentheses (). Internal nodes branch lengths are defined outside the parentheses with "():branch-length" |
| NEXUS/phylip | Incorporates the Newick tree with separate related data that is partitioned into different blocks. Blocks are started with "BEGIN \<BLOCK NAME>;" and closed with "END;" |
| NHX | New Hampshire eXtended format is also based on Newick format but instead of code blocks uses a tagging system for each node extending the format to "taxa:branch-length[&&NHX:<tag information>]" |
read.tree() from the treeio package¶When opening tree files, depending on the format, you can use the treeio package and one of it's many parsing functions but the simplest way to avoid figuring out how to open a tree file, is to just use the read.tree() function. This will determing the appropriate file type and parse through the many different tree formats before depositing the data into a phylo object.
Today we'll be working with a dataset from the Auspice COVID-19 North American dataset maintained by Finlay Maguire
Let's begin by opening up the tree data and some related metadata before taking a look under the hood.
# Import our Newick Tree
SC2_variants_time.phylo <- read.tree("./data/nextstrain_ncov_north-america_canada_timetree.nwk")
# Import our metadata
SC2_metadata.df <- read_tsv("./data/nextstrain_ncov_north-america_canada_metadata.tsv")
-- Column specification -------------------------------------------------------- cols( .default = col_character(), Age = col_double(), url = col_logical(), `Collection Data` = col_date(format = "") ) i Use `spec()` for the full column specifications.
# What does the tree look like?
str(SC2_variants_time.phylo)
List of 6 $ edge : int [1:12776, 1:2] 6628 6628 6629 6630 6631 6630 6632 6632 6633 6629 ... $ edge.length: num [1:12776] 0.0757 0.0848 0.0336 0.0501 0 ... $ Nnode : int 6150 $ node.label : chr [1:6150] "NODE_0008606" "NODE_0000001" "NODE_0000061" "USA/MA1/2020_travel_history" ... $ tip.label : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ... $ root.edge : num 2020 - attr(*, "class")= chr "phylo" - attr(*, "order")= chr "cladewise"
phylo object is a simple list¶From the looks of our structure, after reading in our Newick file, we have a list with six elements. They are pretty clearly named with edge information (a matrix of paired numbers), edge lengths, a count of the number of nodes (ie internal points where branches bifurcate. While mostly just numbers, some of these node.label values appear to be based on travel history information. All of the 6,627 tip.label values correspond to 6,627 SARS-CoV-2 strain names found in the metadata file as well.
Suppose we want to combine some information from our metadata with our tree? We can then use this information to colour, shape, and bring more visual order to our tree. Working with the tree, however, can be a little hard given that we have lists of different lengths representing different portions of the tree.
In our case, we only have tree tip information that we want to add to and the easiest way to work would be if it was in a table format. The package tidytree provides a function as_tibble that will parse and convert the phylo format to a tbl_tree object, which is a kind of tidy data frame.
Let's convert our variant phylo object to something we can work with.
# Convert with as_tibble
SC2_variants_time.tb <- as_tibble(SC2_variants_time.phylo)
# Check out the structure now
str(SC2_variants_time.tb)
head(SC2_variants_time.tb)
tibble [12,777 x 4] (S3: tbl_tree/tbl_df/tbl/data.frame) $ parent : int [1:12777] 6628 6631 6632 6633 6636 6635 6638 6638 6639 6640 ... $ node : int [1:12777] 1 2 3 4 5 6 7 8 9 10 ... $ branch.length: num [1:12777] 0.0757 0 0.0077 0 0 ... $ label : chr [1:12777] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ...
| parent | node | branch.length | label |
|---|---|---|---|
| <int> | <int> | <dbl> | <chr> |
| 6628 | 1 | 0.075652198 | Wuhan/WH01/2019 |
| 6631 | 2 | 0.000000000 | USA/MA1/2020 |
| 6632 | 3 | 0.007696891 | Canada/ON_ON-VIDO-01-2/2020 |
| 6633 | 4 | 0.000000000 | Canada/ON_VIDO-01/2020 |
| 6636 | 5 | 0.000000000 | USA/CA-CDC-5/2020 |
| 6635 | 6 | 0.044204648 | Taiwan/CGMH-CGU-02/2020 |
As you can see our SC2_variants_time.tb object is a simply data frame now a 4-column table with 12,777 rows. The edge data is encoded between the parent and node columns with branch.length to define the length of each edge. We also have all of the tip and node names stored under the label column. With the tree in this format, we can now treat the tree like a tibble and join additional information to the tree.
Rules to keep in mind:
as.treedataYou may have informaiton about the leaves or tips of your tree but by default there is no real internal node information when it comes to sample origins or lineage. When joining information to the tables, use the correct direction *_join() to avoid data loss. A full_join() is probably the safest.
Let's take a quick look at our metadata and identify what we're interested in.
# Look at the metadata, which bits of information would be nice to add?
str(SC2_metadata.df, give.attr = FALSE)
tibble [6,627 x 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Strain : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ... $ GISAID clade : chr [1:6627] "L" "L" "L" "L" ... $ Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ... $ Age : num [1:6627] 44 21 NA NA 54 NA 47 47 23 55 ... $ Clade : chr [1:6627] "19A" "19A" "19A" "19A" ... $ Country : chr [1:6627] "Asia" "USA" "Canada" "Canada" ... $ Admin Division : chr [1:6627] "Asia" "Massachusetts" "Ontario" "Ontario" ... $ Admin Division of exposure: chr [1:6627] "Asia" "Asia" "Ontario" "Asia" ... $ gisaid_epi_isl : chr [1:6627] "EPI_ISL_406798" "EPI_ISL_409067" "EPI_ISL_425177" "EPI_ISL_413015" ... $ Host : chr [1:6627] "Human" "Human" "Human" "Human" ... $ Originating Lab : chr [1:6627] "General Hospital of Central Theater Command of People's Liberation Army of China" "Massachusetts Department of Public Health" "Public Health Ontario" "Public Health Ontario Laboratory" ... $ PANGO lineage : chr [1:6627] "B" "B" "B" "B" ... $ Submission Date : chr [1:6627] "Older" "Older" "Older" "Older" ... $ Region : chr [1:6627] "Asia" "North America" "North America" "North America" ... $ Region of exposure : chr [1:6627] "Asia" "Asia" "North America" "Asia" ... $ Sex : chr [1:6627] "Male" "Male" "Male" "Male" ... $ strain : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ... $ Emerging Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ... $ Submitting Lab : chr [1:6627] "BGI & Institute of Microbiology, Chinese Academy of Sciences & Shandong First Medical University & Shandong Aca"| __truncated__ "Pathogen Discovery, Respiratory Viruses Branch, Division of Viral Diseases, Centers for Disease Control and Prevention" "Public Health Agency of Canada - National Microbiology Laboratory" "National Microbiology Laboratory" ... $ url : logi [1:6627] NA NA NA NA NA NA ... $ Collection Data : Date[1:6627], format: "2019-12-26" "2020-01-29" ... $ Author : chr [1:6627] "Weijun Chen et al (https://dx.doi.org/10.1016/S0140-6736(20)30251-8)" "Clinton R. Paden et al (https://dx.doi.org/10.1101/2020.03.09.20032896)" "Amrit S. Boese et al" "Shari Tyson et al" ... $ Country of exposure : chr [1:6627] NA "Asia" NA "Asia" ... $ Location : chr [1:6627] NA "Boston" NA NA ...
# Check out some of the values for these variables
unique(SC2_metadata.df$'GISAID clade')
unique(SC2_metadata.df$'Nextstrain clade')
unique(SC2_metadata.df$Clade)
unique(SC2_metadata.df$'PANGO lineage')
For our purposes it looks like we will want to explore our data with:
Let's pull that information down and add it to our tbl_tree before converting it to a treedata object.
# Add phylogenetic information to our tree
SC2_variants_time.tree <-
# Pass along the metadata
SC2_metadata.df %>%
# Select just a handful of attributes to go to the tree
select(Strain, 'GISAID clade', 'Emerging Nextstrain clade', 'PANGO lineage', 'Admin Division', 'Collection Data') %>%
rename(strain = Strain,
GISAID = 'GISAID clade',
emerging_nextstrain = 'Emerging Nextstrain clade',
PANGO = 'PANGO lineage',
strain_location = 'Admin Division',
strain_date = 'Collection Data') %>%
# full_join to our tree to ensure no data is lost although you could also carefully use a left_join
full_join(x=SC2_variants_time.tb, y=., by=c("label" = "strain")) %>%
# Convert to a tidytree format
as.treedata()
# Look at the resulting structure
str(SC2_variants_time.tree)
Warning message:
"`mutate_()` is deprecated as of dplyr 0.7.0.
Please use `mutate()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated."
Formal class 'treedata' [package "tidytree"] with 11 slots ..@ file : chr(0) ..@ treetext : chr(0) ..@ phylo :List of 5 .. ..$ edge : int [1:12776, 1:2] 6628 6631 6632 6633 6636 6635 6638 6638 6639 6640 ... .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. ..$ : NULL .. .. .. ..$ : chr [1:2] "parent" "node" .. ..$ edge.length: num [1:12776] 0.0757 0 0.0077 0 0 ... .. ..$ tip.label : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ... .. ..$ node.label : chr [1:6150] "NODE_0008606" "NODE_0000001" "NODE_0000061" "USA/MA1/2020_travel_history" ... .. ..$ Nnode : int 6150 .. ..- attr(*, "class")= chr "phylo" ..@ data : tibble [12,777 x 6] (S3: tbl_df/tbl/data.frame) .. ..$ node : int [1:12777] 1 2 3 4 5 6 7 8 9 10 ... .. ..$ GISAID : chr [1:12777] "L" "L" "L" "L" ... .. ..$ emerging_nextstrain: chr [1:12777] "19A" "19A" "19A" "19A" ... .. ..$ PANGO : chr [1:12777] "B" "B" "B" "B" ... .. ..$ strain_location : chr [1:12777] "Asia" "Massachusetts" "Ontario" "Ontario" ... .. ..$ strain_date : Date[1:12777], format: "2019-12-26" "2020-01-29" ... ..@ extraInfo : tibble [0 x 0] (S3: tbl_df/tbl/data.frame) Named list() ..@ tip_seq : 'DNAbin' raw(0) ..@ anc_seq : 'DNAbin' raw(0) ..@ seq_type : chr(0) ..@ tipseq_file: chr(0) ..@ ancseq_file: chr(0) ..@ info : list()
Take a quick look at the structure of this treedata object. What do we see? The treedata class has converted our tibble back into an S4 object with 11 slots. Each slot, is essentially a placeholder for another object. We can access these slots using the @ operator and we can further access sub elements with the $ operator.
For instance, we can see there is a phylo slot which looks suspiciously like our original SC2_variants_time.phylo object. The rest of our data frame information, which we just added to the tbl_tree before converting it is stored in the data slot. There are additional slots where we can add sequencing information for comparison in different types of phylogenetic tree visualizations.
# Grab our phylo object
SC2_variants_time.tree@phylo
# Take a peek at our associated node data
head(SC2_variants_time.tree@data)
Phylogenetic tree with 6627 tips and 6150 internal nodes. Tip labels: Wuhan/WH01/2019, USA/MA1/2020, Canada/ON_ON-VIDO-01-2/2020, Canada/ON_VIDO-01/2020, USA/CA-CDC-5/2020, Taiwan/CGMH-CGU-02/2020, ... Node labels: NODE_0008606, NODE_0000001, NODE_0000061, USA/MA1/2020_travel_history, NODE_0000062, Canada/ON_VIDO-01/2020_travel_history, ... Unrooted; includes branch lengths.
| node | GISAID | emerging_nextstrain | PANGO | strain_location | strain_date |
|---|---|---|---|---|---|
| <int> | <chr> | <chr> | <chr> | <chr> | <date> |
| 1 | L | 19A | B | Asia | 2019-12-26 |
| 2 | L | 19A | B | Massachusetts | 2020-01-29 |
| 3 | L | 19A | B | Ontario | 2020-01-23 |
| 4 | L | 19A | B | Ontario | 2020-01-23 |
| 5 | L | 19A | B | California | 2020-01-29 |
| 6 | S | 19A | B | Asia | 2020-02-04 |
ggtree() from the ggtree package¶Now that we've put some extra data into our tree, we are ready to plot it with the help of the ggtree() function from the ggtree package. If you haven't noticed yet, treeio, tidytree and ggtree form a suite of packages that we can use to import, alter, and visualize our trees.
Parameters for ggtree() include:
tr: the phylo tree objectlayout: the shape of the tree including: rectangular, dendrogram, slanted, ellipse, fan, circular, inward_circular, and radial.msrd: most recent sampling date used for setting a time-based scaleas.Date: logical to specify if date will be in calendar vs decimal-date formatbranch.length: variable for scaling branch lengthroot.position: the position of our root node (default = 0)We'll also use theme_tree2() which will automatically add an x-axis scale to our tree.
We'll start simple by just visualizing all 6,627 tips of our tree.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
# Show the entire tree
SC2_variants_time.tree %>%
# 1. Data
ggtree(tr = .) +
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 30),
legend.position = "bottom") + # Move our legend to the bottom
# Provide some labels
labs(x="Time",
title = "Canada and US sequenced strain phylogeny") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size = 2)))
drop.tip()¶Looking at our visualization, there are a lot of tips to display and we've coloured the tree along the branches. Another thing we can see is that there is an "NA" value. This mostly represents the branches of the tree connecting nodes internally since they do not have a specific lineage associated from our data.
As it stands, there's just way too much data here to look at and work with. We can trim that data using the drop.tip() function without worrying about how the tree is parsed. Internally, all that pruning will take place within the function. To do this, we'll generate a list of tips we want to remove first.
# You can make a list of tips to drop when you're making your ggtree
# Generate a table of tree information to drop
to_drop <-
SC2_metadata.df %>%
# Filter for strains that are not from Canada and are from before 2020-11-01
filter(!(Country == "Canada" &
`Collection Data` >= as.Date("2020-11-01")
)) %>%
# Just keep the strain information
select(Strain) %>%
unlist()
# Take a look at what we got back
str(to_drop)
Named chr [1:6199] "Wuhan/WH01/2019" "USA/MA1/2020" ... - attr(*, "names")= chr [1:6199] "Strain1" "Strain2" "Strain3" "Strain4" ...
geom_* layers¶The ggtree package brings a number of ggplot2-compatible geoms to our finger tips. We'll spruce up our tree with a some tip labels to begin with.
| Component type | Layer type | Command | Description |
|---|---|---|---|
| Tree | Geom | geom_tree | tree structure layer, with multiple layout supported |
| Tree | Geom | geom_treescale | tree branch scale legend |
| Node | Geom | geom_nodepoint | annotate internal nodes with symbolic points |
| Node | Annotation | geom_range | bar layer to present uncertainty of evolutionary inference |
| Node | Annotation | geom_rootpoint | annotate root node with symbolic point |
| Branch | Annotation | geom_label2 | modified version of geom_label, with subsetting supported for labelling branches |
| Branch | Annotation | geom_segment2 | modified version of geom_segment, with subsetting supported |
| Taxa | Geom | geom_point2 | modified version of geom_point, with subsetting supported |
| Taxa | Geom | geom_tippoint | annotate external nodes with symbolic points |
| Taxa | Annotation | geom_taxalink | associate two related taxa by linking them with a curve |
| Taxa | Annotation | geom_text2 | modified version of geom_text, with subsetting supported |
| Taxa | Annotation | geom_tiplab | layer of tip labels |
| Taxa | Annotation | geom_tiplab2 | layer of tip labels for circular layout |
| Clade | Annotation | geom_balance | highlights the two direct descendant clades of an internal node |
| Clade | Annotation | geom_cladelabel | annotate a clade with bar and text label |
| Clade | Annotation | geom_hilight | highlight a clade with rectangle |
| Clade | Annotation | geom_strip | annotate associated taxa with bar and (optional) text label |
# Show our pruned tree
SC2_variants_time.tree %>%
# drop tips based on our list
drop.tip(., to_drop) %>%
# 1. Data
ggtree(tr = .) + # Need the most recent sampling date
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 30),
legend.position = "bottom") + # Move our legend to the bottom
# Provide some labels
labs(x="Time",
title = "Subset of Canadian sequenced strains") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2))) +
# 4. Geoms
geom_tiplab()
In a tree this packed, you can't really read the tip labels. The data that's most helpful, however, is the colouring. You can replace your labels with points in 3 ways:
geom_point()geom_nodepoint()geom_tippoint()If we use geom_point() we can colour/annotate all of the nodes and tips but we already know that there isn't any phylogenetic information associated with our internal nodes. Likewise, geom_nodepoint() will allow us to colour the nodes but there isn't any grouping information associated with that.
Therefore we'll use geom_tippoint() to colour our tips based on their Nextstrain clade information and we can alter the tip shape to match the province where the strain was sequenced.
# Show entire tree
SC2_variants_time.tree %>%
drop.tip(., to_drop) %>%
# 1. Data
ggtree(tr = .) + # Need the most recent sampling date
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 30),
legend.position = "bottom", # Move our legend to the bottom
legend.box = "vertical") + # Stack legends vertically
# Provide some labels
labs(x="Time",
title = "Subset of Canadian sequenced strains") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2)),
shape = guide_legend(title = "Strain\nlocation")
) +
# 3. Scaling
scale_shape_manual(values = c(15:20, 11)) +
# 4. Geoms
geom_tippoint(aes(shape = strain_location), size = 3) # Add tips only
mrsd parameter¶As you can see we are working with a time-based axis but it all appears to be on a relative scale and what we'd really like to see is real-world time. In order to do that, we can assign the mrsd parameter to the most recent sampling date in our dataset. We'll also set our parameter as.Date in order to see the date in a calendar format rather than a decimal-date format.
# Show entire tree
SC2_variants_time.tree %>%
drop.tip(., to_drop) %>%
# 1. Data
ggtree(tr = ., mrsd = "2021-02-17", as.Date = TRUE) + # Need the most recent sampling date
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 30),
legend.position = "bottom", # Move our legend to the bottom
legend.box = "vertical") + # Stack legends vertically
# Provide some labels
labs(x="Time",
title = "Subset of Canadian sequenced strains") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2)),
shape = guide_legend(title = "Strain\nlocation")
) +
# 3. Scaling
scale_shape_manual(values = c(15:20, 11)) +
# 4. Geoms
geom_tippoint(aes(shape = strain_location), size = 3) # Add tips only
viewClade() and other information functions¶Looking at the tree there are still quite a few nodes but we can zoom in on a specific clade using the viewClade() function. To accomplish this we need to access the MRCA of a group. At the same time, we should talk about ways to traverse or navigate this tree. Here's where tidytree offers up a few additional functions for searching your tbl_tree. These take on the form of function(tbl_tree, node_number) or function(tbl_tree, node_label)
| function | Description |
|---|---|
| child() | Find the children of an internal node. |
| parent() | Find the parent node of a tip or other node. |
| offspring() | Find all of the offspring of a node. |
| ancestor() | Find all of the ancestors of a tip or node |
| sibling() | Find the nearest sibling of a tip (or node) |
| MRCA() | Find the most recent common ancestor between two or more nodes |
Let's filter our taxa list a little more and then find the MRCA between those tips.
# You can make a list of tips to drop when you're making your ggtree
# Generate a table of tree information to drop
to_view <-
SC2_metadata.df %>%
# Filter our tip information
filter(`Emerging Nextstrain clade` %in% c("20I/501Y.V1", "20H/501Y.V2") &
Country == "Canada" &
`Collection Data` >= as.Date("2021-02-16")
) %>%
# Just keep the strain names
select(Strain) %>%
unlist()
# What's the list look like?
to_view
# Calculate the MRCA from a subset of our tips
to_view_MRCA <-
# Pass along our tree information
SC2_variants_time.tree %>%
# Get rid of the tips we would before (to match the tree we'll plot)
drop.tip(., to_drop) %>%
# Determine the MRCA between the first 4 tips in the list
MRCA(., to_view[1:4])
to_view_MRCA
# How many offspring does it have?
# Pass along our tree information
SC2_variants_time.tree %>%
drop.tip(., to_drop) %>%
as_tibble() %>% # make a tibble so we can see all the relevant information
offspring(., to_view_MRCA) %>%
as.treedata() %>%
str()
Formal class 'treedata' [package "tidytree"] with 11 slots ..@ file : chr(0) ..@ treetext : chr(0) ..@ phylo :List of 5 .. ..$ edge : int [1:71, 1:2] 728 730 731 732 732 734 735 735 736 737 ... .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. ..$ : NULL .. .. .. ..$ : chr [1:2] "parent" "node" .. ..$ edge.length: num [1:71] 0.1136 0.067 0.0521 0.0306 0.0333 ... .. ..$ tip.label : chr [1:39] "Canada/NL-NML-16819/2021" "Canada/NL-NML-16824/2021" "Canada/NL-NML-17178/2021" "Canada/NL-NML-16829/2021" ... .. ..$ node.label : chr [1:32] "NODE_0006538" "NODE_0006539" "NODE_0006540" "NODE_0006541" ... .. ..$ Nnode : int 32 .. ..- attr(*, "class")= chr "phylo" ..@ data : tibble [71 x 6] (S3: tbl_df/tbl/data.frame) .. ..$ node : int [1:71] 323 324 325 326 327 328 329 330 331 332 ... .. ..$ GISAID : chr [1:71] "GRY" "GRY" "GRY" "GRY" ... .. ..$ emerging_nextstrain: chr [1:71] "20I/501Y.V1" "20I/501Y.V1" "20I/501Y.V1" "20I/501Y.V1" ... .. ..$ PANGO : chr [1:71] "B.1.1.7" "B.1.1.7" "B.1.1.7" "B.1.1.7" ... .. ..$ strain_location : chr [1:71] "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" "Newfoundland and Labrador" ... .. ..$ strain_date : Date[1:71], format: "2021-02-10" "2021-02-12" ... ..@ extraInfo : tibble [0 x 0] (S3: tbl_df/tbl/data.frame) Named list() ..@ tip_seq : 'DNAbin' raw(0) ..@ anc_seq : 'DNAbin' raw(0) ..@ seq_type : chr(0) ..@ tipseq_file: chr(0) ..@ ancseq_file: chr(0) ..@ info : list()
ggplot2 object to viewClade()¶Now that we have an MRCA node (727) that we can use to determine the specific clade we can see the tree will still be quite large. To view this particular clade, we build our plot as before, and then zoom into it afterwards with the viewClade() function.
Unfortunately, we'll have to ditch our date scale and revert to a decimal-date. If you're not using a calendar-based date, it's not a problem at all. If we look carefully at the data, we'll see that the internal nodes have an NA-value. If you were industrious enough, you could use the branch lengths to calculate traverse the tree and calculate dates for all the internal nodes as well. This would likely solve the issue.
Let's drop the tip names back in there too.
# Show a small clade of the tree
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., to_drop) %>%
# 1. Data
ggtree(tr = ., mrsd = "2021-02-16") + # Need the most recent sampling date
# 2. Aesthetics
aes(color = emerging_nextstrain) +
# Themes
theme_tree2() +
theme(text = element_text(size = 30),
legend.position = "bottom", # Move our legend to the bottom
legend.box = "vertical") + # Stack legends vertically
# Provide some labels
labs(x="Time",
title = "Subset of Canadian sequenced strains") +
# Make our guide lines a little thicker
guides(colour = guide_legend(title = "Nextstrain\nclade",
override.aes = list(size=2)),
shape = guide_legend(title = "Strain\nlocation")
) +
# 3. Scaling
scale_shape_manual(values = c(15:20, 11)) +
# 4. Geoms
geom_tiplab(size = 7, align=TRUE) + # Add labels in and right-align them
geom_tippoint(aes(shape = strain_location), size = 3) # Add tips only
# View the clade we made using the same MRCA call we already used.
viewClade(tree.plot, MRCA(tree.plot, to_view[1:4]))
Warning message: "Duplicated aesthetics after name standardisation: size"
So we've looked at a few ways to subset and plot our phylogenetic tree. We'll do a quick aside to create a more curated list of tips with representation from across multiple clades. We'll use this as the base tree for our next few graphs so that we can really see the power of using additional information from our treedata object.
# Here's our representative list of tips from multiple clades
curated_tips <-
c('Wuhan/WH01/2019',
'Canada/ON_ON-VIDO-01-2/2020',
'Canada/ON_VIDO-01/2020',
'Canada/BC_37_0-2/2020',
'Canada/BC_69243/2020',
'Canada/QC-CHUM-2008003956A/2020',
'Canada/NL-NML-1387/2020',
'Canada/MB-NML-808/2020',
'Canada/NB-NML-16967/2021',
'Canada/BC_6129127/2020',
'Canada/AB-97776/2020',
'Canada/AB-12948/2020',
'Canada/BC-BCCDC-3564/2020',
'Canada/ON-S67/2020',
'Canada/AB-65233/2020',
'Canada/MB-NML-1057/2020',
'Canada/ON-UHTC-0366/2020',
'Canada/NS-NML-16218/2021',
'Canada/NB-NML-16966/2021',
'Canada/NS_13/2020',
'Canada/Qc-CHUM-2019203856A/2020',
'Canada/NB-NML-3272/2020',
'Canada/ON-UHTC-0267/2020',
'Canada/MB-NML-1037/2020',
'Canada/NS-NML-5213/2020',
'Canada/NS-NML-5141/2021',
'Canada/NS-NML-5140/2021',
'Canada/NS-NML-16208/2021',
'Canada/BC-BCCDC-7012/2020',
'Canada/ON-LTRI-1372/2020',
'Canada/BC-BCCDC-9736/2021',
'Canada/ON-S2125/2021',
'Canada/NS-NML-5181/2020',
'Canada/ON-PPS-21012021_0269/2021',
'Canada/QC-L00314539/2020',
'Canada/NS-NML-16238/2021',
'Canada/NL-NML-16822/2021',
'Canada/QC-L00324589001/2021',
'Canada/NL-NML-17194/2021')
# Make a new list of tips to drop
curated_drop_list <-
SC2_metadata.df %>%
# filter the opposite of our tip labels
filter(!(Strain %in% curated_tips)) %>%
select(Strain) %>%
unlist()
curated_tree_info.df <-
# Pass along our original tree
SC2_variants_time.tree %>%
# Drop the tips we created
drop.tip(., curated_drop_list) %>%
# Convert to a tibble
as_tibble() %>%
# select columns we want
select(4:8) %>%
# Rename our column information
rename(Clade = emerging_nextstrain, Province = strain_location) %>%
# Convert this to a data frame
as.data.frame()
# Set the rownames and drop the label information
rownames(curated_tree_info.df) <- curated_tree_info.df$label
curated_tree_info.df <- curated_tree_info.df[,2:5]
# Let's view the tree
str(curated_tree_info.df)
'data.frame': 72 obs. of 4 variables: $ GISAID : chr "L" "L" "L" "S" ... $ Clade : chr "19A" "19A" "19A" "19B" ... $ PANGO : chr "B" "B" "B" "A.1" ... $ Province: chr "Asia" "Ontario" "Ontario" "Manitoba" ...
layout parameter¶So we've only been working with a rectangular tree (horizontal layout) thus far but as we mentioned at the beginning there are actually a number of different layouts we can use. We're going to make a sort of sunburst plot now using a circular layout that we'll add visual categorical information to in a ring pattern.
First, we'll start with the circular version. We'll colour our tips by the province where the case or data was generated and label them by the Nextstrain clade information.
# Show our curated tree in a circular format
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = "circular", mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 30),
legend.position = "bottom") +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
scale_colour_viridis_d(option = "plasma") +
# 4. Geoms
geom_tippoint(aes(colour = strain_location), size = 5) + # Add tips only
geom_tiplab(aes(label = emerging_nextstrain), size = 8)
# View our tree
tree.plot
gheatmap() to add information to your plot¶Now that we have our basic circular tree, we can use the information from curated_tree_info.df to visualize additional data on our tree. In order to use the gheatmap layer properly, we must pass it our original tree plot, along with a new dataframe (or vector) that holds the information we want to add. It should use the rownames from the data frame to help match up to the tips in our tree.
We'll drop the tip labels from our tree in favour of the heatmap we'll add.
# Show our curated tree in a circular format
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = "circular", mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 30),
legend.position = "bottom") +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
scale_colour_viridis_d(option = "plasma") +
# 4. Geoms
geom_tippoint(aes(colour = strain_location), size = 5) # Add tips only
# Pass along the tree to gheatmap() and give it just the Nextstrain information in column 2
# Remember we've renamed the columns in curated_tree_info.df
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(2),
offset = 0.1, width = 0.1,
colnames_angle = 90, font.size = 8) +
scale_fill_discrete(name = "Nextstrain\nClade")
tree.plot
Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
ggnewscale package allows multiple color/fill scales¶From our sunburst plot above we could have actually used multiple columns from curated_tree_info.df but the caveat is that the legend for each resulting column will combine all of the data into a single new legend. That's helpful if the type of data could be continuous values like a [0,1] scale heatmap colour across different features. When dealing with multiple categories, however, that doesn't work well for us.
Instead, we'd like to separate the colour/fill scales by repeatedly adding to our plot, layer by layer. As you might recall, however, we also run into the problem of scale_colour_* and scale_fill_* layers overwriting any previous layers.
To circumvent this reality, we'll use the ggnewscale package to generate new colour and fill scales with new_scale_fill() and new_scale_colour(). It can make the process slightly more encumbering but it will work out.
We'll add back in our original tip labels as well for this graph.
# Adjust our plot window size according to the expected output
options(repr.plot.width=50, repr.plot.height=50)
# Show our curated tree in a circular format
# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))
# Save our first plot with just the tree
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = "circular", mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 30),
legend.position = "bottom") +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
# We'll want to ensure our colour guide ends up in the correct order
scale_colour_viridis_d(option = "plasma",
guide = guide_legend(title="Province", order=1)) +
# 4. Geoms
geom_tippoint(aes(colour = strain_location), size = 5) + # Add tips only
geom_tiplab(size = 10, align=TRUE, offset = 0.35) # Add in our labels
# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(2),
offset = 0.02, width = 0.1,
colnames_angle = 90, font.size = 12) +
# Set the name of our legend
scale_fill_discrete(name = "Nextstrain\nClade",
guide = guide_legend(order=2))
tree.plot <- tree.plot + new_scale_fill() # use this
tree.plot <-
gheatmap(tree.plot, curated_tree_info.df %>% select(3),
offset = 0.17, width = 0.1,
colnames_angle = 90, font.size = 12) +
scale_fill_manual(values = combo.colours, name = "PANGO\nlineage",
guide=guide_legend(order=3))
tree.plot
Warning message: "Duplicated aesthetics after name standardisation: size" Warning message: "Duplicated aesthetics after name standardisation: size" Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale. Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
# Adjust our plot window size according to the expected output
options(repr.plot.width=30, repr.plot.height=30)
# Show our curated tree in a rectangular format
# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))
# Save our first plot with just the tree
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = "rectangular", mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 30),
legend.position = "bottom") +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
# We'll want to ensure our colour guide ends up in the correct order
scale_colour_viridis_d(option = "plasma",
guide = guide_legend(title="Province", order=1)) +
# 4. Geoms
geom_tippoint(aes(colour = strain_location), size = 5) + # Add tips only
geom_tiplab(size = 10, align=TRUE) # Add in our labels
# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(2),
offset = 0.7, width = 0.1,
colnames_angle = 0, font.size = 10) +
# Set the name of our legend
scale_fill_discrete(name = "Nextstrain\nClade",
guide = guide_legend(order=2))
tree.plot <- tree.plot + new_scale_fill() # use this
tree.plot <-
gheatmap(tree.plot, curated_tree_info.df %>% select(3),
offset = 0.85, width = 0.1,
colnames_angle = 0, font.size = 10) +
scale_fill_manual(values = combo.colours, name = "PANGO\nlineage",
guide=guide_legend(order=3))
tree.plot
Warning message: "Duplicated aesthetics after name standardisation: size" Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale. Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
geom_taxalink()¶Suppose we want to join to tips/taxa together to suggest that there is a relationship between them or to emphasize some association between nodes. To connect nodes or tips of the tree, we can use the geom_taxalink() layer.
# Adjust our plot window size according to the expected output
options(repr.plot.width=30, repr.plot.height=30)
# Show our curated tree in a rectangular format
# We'll need some extra colour sets to accomplish this plot
combo.colours = c(brewer.pal(12, "Paired"), brewer.pal(8, "Set1"), brewer.pal(12, "Set3"))
# Save our first plot with just the tree
tree.plot <-
SC2_variants_time.tree %>%
drop.tip(., curated_drop_list) %>%
# 1. Data
ggtree(., layout = "rectangular", mrsd = "2021-02-17") + # Need the most recent sampling date
# 2. Aesthetics
# Themes
theme(text = element_text(size = 30),
legend.position = "bottom") +
labs(title = "Canada and US sequenced strain phylogeny",
colour = "Province") +
# We'll want to ensure our colour guide ends up in the correct order
scale_colour_viridis_d(option = "plasma",
guide = guide_legend(title="Province", order=1)) +
# 4. Geoms
geom_tippoint(aes(colour = strain_location), size = 5) + # Add tips only
geom_tiplab(size = 10, align=TRUE) + # Add in our labels
# Add in some annotation between points Canada/NL-NML-17194/2021 and Canada/ON-UHTC-0366/2020
geom_taxalink(taxa1="Canada/NL-NML-17194/2021", taxa2="Canada/ON-UHTC-0366/2020",
color="blue", alpha=0.8, offset=0,
outward=FALSE,
arrow=arrow(length=unit(0.01, "npc")))
# Generate our first heatmap layer
tree.plot <- gheatmap(tree.plot, curated_tree_info.df %>% select(2),
offset = 0.7, width = 0.1,
colnames_angle = 0, font.size = 10) +
# Set the name of our legend
scale_fill_discrete(name = "Nextstrain\nClade",
guide = guide_legend(order=2))
tree.plot <- tree.plot + new_scale_fill() # use this
tree.plot <-
gheatmap(tree.plot, curated_tree_info.df %>% select(3),
offset = 0.85, width = 0.1,
colnames_angle = 0, font.size = 10) +
scale_fill_manual(values = combo.colours, name = "PANGO\nlineage",
guide=guide_legend(order=3))
tree.plot
Warning message: "Duplicated aesthetics after name standardisation: size" Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale. Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
ggtree package¶So we've spent quite a bit of time looking at phylogenetic trees and how to add extrenal data to tips and nodes. We've barely scratched the surface and there are a lot of additional functions within the ggtree package that you can work with to build amazing plots including adding fasta sequence data through msaplot(). You can experiment with labeling internal nodes, and specific clades as well. You can also facet_plot() rectangular trees with other kinds of plots!
We didn't even have time to combine all sorts of plot data with our circular trees using the ggtreeExtra package. Definitely check out Chapter 10 of the tree book to be inspired by the potential things you could do with more complex datasets like this:
We've seen a lot of trees today but a closely related structure is the network diagram. They share similar concepts in that both have nodes and branches. However, nodes are called vertices and can represent almost anything, branches between nodes are edges and can span across multiple nodes, instead of in a bifurcating tree relationship. Relationships between vertices can be bidirectional, and depending on what you'd like to present, multiple parallel edges may exist.
Network graphs are great for showing the interconnections between entities within your data. This kind of visualization can also help us realize where subtle connections occur (or don't occur!). Depending on how you've weighted or chosen edges, you can convey how strong relationships between entities are as well.
To work with graph data and plot it, we'll be using a couple of companion packages: tidygraph and ggraph. More about that later!
Since we already have some metadata about COVID genomes, let's see if we can't convert some of it to a network diagram to better understand strain information? We'll begin by selecting some information from our Nextstrain metadata set.
str(SC2_metadata.df, give.attr = FALSE)
tibble [6,627 x 24] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Strain : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ... $ GISAID clade : chr [1:6627] "L" "L" "L" "L" ... $ Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ... $ Age : num [1:6627] 44 21 NA NA 54 NA 47 47 23 55 ... $ Clade : chr [1:6627] "19A" "19A" "19A" "19A" ... $ Country : chr [1:6627] "Asia" "USA" "Canada" "Canada" ... $ Admin Division : chr [1:6627] "Asia" "Massachusetts" "Ontario" "Ontario" ... $ Admin Division of exposure: chr [1:6627] "Asia" "Asia" "Ontario" "Asia" ... $ gisaid_epi_isl : chr [1:6627] "EPI_ISL_406798" "EPI_ISL_409067" "EPI_ISL_425177" "EPI_ISL_413015" ... $ Host : chr [1:6627] "Human" "Human" "Human" "Human" ... $ Originating Lab : chr [1:6627] "General Hospital of Central Theater Command of People's Liberation Army of China" "Massachusetts Department of Public Health" "Public Health Ontario" "Public Health Ontario Laboratory" ... $ PANGO lineage : chr [1:6627] "B" "B" "B" "B" ... $ Submission Date : chr [1:6627] "Older" "Older" "Older" "Older" ... $ Region : chr [1:6627] "Asia" "North America" "North America" "North America" ... $ Region of exposure : chr [1:6627] "Asia" "Asia" "North America" "Asia" ... $ Sex : chr [1:6627] "Male" "Male" "Male" "Male" ... $ strain : chr [1:6627] "Wuhan/WH01/2019" "USA/MA1/2020" "Canada/ON_ON-VIDO-01-2/2020" "Canada/ON_VIDO-01/2020" ... $ Emerging Nextstrain clade : chr [1:6627] "19A" "19A" "19A" "19A" ... $ Submitting Lab : chr [1:6627] "BGI & Institute of Microbiology, Chinese Academy of Sciences & Shandong First Medical University & Shandong Aca"| __truncated__ "Pathogen Discovery, Respiratory Viruses Branch, Division of Viral Diseases, Centers for Disease Control and Prevention" "Public Health Agency of Canada - National Microbiology Laboratory" "National Microbiology Laboratory" ... $ url : logi [1:6627] NA NA NA NA NA NA ... $ Collection Data : Date[1:6627], format: "2019-12-26" "2020-01-29" ... $ Author : chr [1:6627] "Weijun Chen et al (https://dx.doi.org/10.1016/S0140-6736(20)30251-8)" "Clinton R. Paden et al (https://dx.doi.org/10.1101/2020.03.09.20032896)" "Amrit S. Boese et al" "Shari Tyson et al" ... $ Country of exposure : chr [1:6627] NA "Asia" NA "Asia" ... $ Location : chr [1:6627] NA "Boston" NA NA ...
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=20)
SC2_graph_info.df <-
# Pass along our metadata
SC2_metadata.df %>%
# We'll grab information about each case that relates to its geographical region
select(Strain, Country, 'Admin Division', 'Admin Division of exposure', 'Region of exposure',
'Nextstrain clade', 'GISAID clade', 'PANGO lineage', 'Collection Data') %>%
# Do a bunch of renaming for our variables
rename(case_source = 'Admin Division of exposure',
case_location = 'Admin Division',
source_region = 'Region of exposure',
collection_date = 'Collection Data',
Nextstrain = 'Nextstrain clade',
GISAID = 'GISAID clade',
PANGO = 'PANGO lineage'
)
head(SC2_graph_info.df)
| Strain | Country | case_location | case_source | source_region | Nextstrain | GISAID | PANGO | collection_date |
|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <date> |
| Wuhan/WH01/2019 | Asia | Asia | Asia | Asia | 19A | L | B | 2019-12-26 |
| USA/MA1/2020 | USA | Massachusetts | Asia | Asia | 19A | L | B | 2020-01-29 |
| Canada/ON_ON-VIDO-01-2/2020 | Canada | Ontario | Ontario | North America | 19A | L | B | 2020-01-23 |
| Canada/ON_VIDO-01/2020 | Canada | Ontario | Asia | Asia | 19A | L | B | 2020-01-23 |
| USA/CA-CDC-5/2020 | USA | California | Asia | Asia | 19A | L | B | 2020-01-29 |
| Taiwan/CGMH-CGU-02/2020 | Asia | Asia | Asia | Asia | 19A | S | B | 2020-02-04 |
tidygraph package to help prepare data¶Now that we have some information that we want to work with, we want to convert that kind of data to something that can be interpreted into a graph. The tidygraph package provides a way to hook graph data into the tidyverse so that we can use common verbs and ideas to filter and work with it. Using this package we can convert our dataframe information into a tbl_graph object which is actually an igraph object.
The function we'll use to convert our data is as_tbl_graph() but it has some expectations about the data. The parameters of this function are:
x: the data frame we'd like to convert to an igraph.nodes: a data frame with our node information.edges: a data frame of two columns containing integers matching node information to describe the relationship between nodes.If you have both a nodes and edges data frame you can certainly generate a table this way. However, we have a complex dataframe and we want to track all of the information in it. To that end we will add columns from and to based on variables that already exist and the graph nodes will be generated from this information. When we provide the data frame, the function will recognize the columns present and produce node-based data and generate edge characteristics from our other columns.
# Now we'll add some specific variables used to generate our graph table
SC2_graph_info.df %>%
mutate(from = case_source,
to = Country
) %>%
# Filter for just strain data from Canada and Asia
filter(Country %in% c(...)) %>%
# Convert to an igraph
as_tbl_graph() %>%
# Look at the structure of the igraph
str()
ggraph()¶The ggraph() function is one of many from a package of the same name. This package adds geoms and architecture that is compatible with ggplot2 (for the most part). Some of the functions we are interested in are:
| Component | Function | Description |
|---|---|---|
| Graph | ggraph | The equivalent of the call to ggplot, it sets up the basic information about the graph plot |
| Edge | geom_edge_link | Produces edges between points |
| Edge | geom_edge_fan | Produces edges between points but accounts for parallel edges, creating arcs that fan out |
| Edge | geom_edge_parallel | Produces multiple edges between points with parallel lines representing parallel edges |
| Edge | geom_edge_loop | Used to represent edges that start and end at the same node. Does not account for parallel edges |
| Node | geom_node_point | Add basic nodes to your graph |
| Node | geom_node_circle | Add nodes to your graph that can be scaled by the coordinate system |
The ggraph() function takes in 3 parameters:
graph the igraph object that we'll base the graph on.layout the type of layout for the graph. There are many options including: auto, dendrogram, linear, matrix, treemap (yep!) circlepack, partition, and hive.circular a logical stating whether the layout should be transformed into a radial representation. It can't be applied to all layouts (think polar_coord).# Now we'll add some specific variables used to generate our graph table
SC2_graph_info.df %>%
# Generate our from and to edge information
mutate(from = source_region,
to = Country
) %>%
# Convert to an igraph
as_tbl_graph() %>%
# 1. Data
ggraph(.) +
# 2. Aesthetics
theme(text = element_text(size = 20)) +
# Add our edges
...(aes(colour = Nextstrain), arrow=arrow(), width = 1) +
# Add our nodes and label them
...(size = 7) +
...(aes(label = name), size = 7, repel = TRUE)
geom_edge_fan() to visualize parallel edges¶So from our graph we can see some characteristics about how our case data is connected. Although most of the data is centred around genomes sequenced in North America so we can see that "North America" is a source region likely due the fact that we should use something like case_source instead for our from variable. We can change that if we want.
The way we see the edges, however also has all of them overlaying on top of each other. We know that this isn't going to be a 1:1 relationship and although we have coloured the edges, we only see the topmost edge in a stack of many. For a graph like this, if we want to see all of the edges, we can use the geom_edge_fan() command which will help us see all of our parallel edges in all of their splendour. Let's try it out.
# Now we'll add some specific variables used to generate our graph table
SC2_graph_info.df %>%
# Generate our from and to edge information
mutate(from = source_region,
to = Country
) %>%
# Convert to an igraph
as_tbl_graph() %>%
# 1. Data
ggraph(.) +
# 2. Aesthetics
theme(text = element_text(size = 20)) +
# Add our edges
geom_edge_fan(aes(colour = Nextstrain), arrow=arrow()) +
# Add our nodes and label them
geom_node_point(size = 7) +
geom_node_text(aes(label = name), size = 7, repel = TRUE)
The metadata we've chosen isn't very consistent and that can be the case for many datasets so you'll want to be sure of how you pick your nodes. Right now we are picking our from value based on the supposed source of the case. This can vary in consistency from a region (Asia) to a province (Ontario). Source regions also very wildly in our data set so we are kind of mixing many kinds of places in our graph. Definitely aim to be consisten when you're working with your data otherwise the nodes may have less meaning.
Let's try looking at case_source to case_location as it relates to data from Canada and Asia
# Now we'll add some specific variables used to generate our graph table
SC2_graph_info.df %>%
# Generate our from and to edge information
mutate(from = case_source,
to = ...
) %>%
# Filter for Canada data
filter(Country %in% c("Canada", "Asia")) %>%
# Convert to an igraph
as_tbl_graph() %>%
# 1. Data
ggraph(.) +
# 2. Aesthetics
theme(text = element_text(size = 20)) +
# Add our edges
geom_edge_fan(aes(colour = Nextstrain), arrow=arrow()) +
...() +
# Add our nodes and label them
geom_node_point(size = 7) +
geom_node_text(aes(label = name), size = 7, repel = TRUE)
ggraph¶We've really only covered linear graph layouts in our data but there are actually many kinds of graphs that this package can produce. The layout parameter is the key to exploring all of the graph types available. Of course your data layout all has to make sense! You'll find great examples from data-imaginist.com like this circlepack graph:
When working with data sometimes you may be working with sequence-specific data for analysis of motifs or you may have a large set of data describing genomic markers like single nucleotide variants. Two popular visualizations in this domain are the sequence logo and Manhattan plot.
ggseqlogo¶This one is short and simple. If you have a series of sequences you'd like to plot based on frequency of each bases you see at each position, the sequence logo is for you. The ggseqlogo package offers a simple ggplot2-compatible set of geoms you can add to your graph in order to represent this data.
Your data can come in two flavours:
Let's work with the former since it does have some flexibility and looks like something we recognize.
# Build an example sequence set (it's actually from the SARS-CoV-2 spike sequence )
example_seq1 <- c("aacccactaatggtgttggtt","aatccactaatggtgttgctt","aacccaataatagtgttggtt",
"aacccactaatggtgttggtt","aacccactaatggtattgatt","accccactaatgttgttggtt",
"aacccactaatggtgttggtt","aacccactaatggtattgatt","accccactaatgttgttggtt") %>%
# Convert it all to upper case because ggseqlogo appears to hate when we don't
str_to_upper()
example_seq2 <- c("cacccactaatggtgttggtg","aatccactaatggtgttgctt","aacccaataatagtgttggtt",
"cacccactaatggtgttggtg","aacccactaatggtattgatt","accccactaatgttgttggtt",
"cacccactaatggtgttggtg","aacccactaatggtattgatt","accccactaatgttgttggtt") %>%
# Convert it all to upper case because ggseqlogo appears to hate when we don't
str_to_upper()
# Build our list
example_seq.list = list(spike_set1 = ..., spike_set2 = ...)
# Take a look at our list
str(example_seq.list)
ggseqlogo() wrapper function¶We have two options to look at our sequence data. The ggseqlogo() is a wrapper that sets up the entire plot for us including if you'd like to facet the data. If you just want a single logo then you pass parts of the list individually.
If you'd like multiple logos, you can decided their layout with the ncol or nrow parameters. The other parameters are
facet which will accept either "wrap" or "grid"scales which will accept the scape types for the facet geom# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=4)
ggseqlogo(..., ...) +
theme(text = element_text(size = 20)) +
labs(title = "Facet in rows")
ggseqlogo(..., ...) +
theme(text = element_text(size = 20)) +
labs(title = "Facet in columns")
geom_logo()¶To plot our data we can use ggplot language that we are familiar with and add a layer to our plot with the geom_logo() function. In order for this to function properly, however, we need to define our parameters as:
data: the vector with of our sequence dataseq_type: sets the type of sequence like "DNA", "RNA" or "AA"method: the y-scale of our logo either in "bits" (information value) or by "probability"These and other parameters can also be set in the ggseqlogo() function too.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=4)
# 1. Data
ggplot() +
# 2. Aesthetics
theme_logo() +
theme(text = element_text(size = 20)) +
labs(title = "Facet in columns") +
# 4. Geoms
geom_logo(data = example_seq.list, seq_type="DNA", method = "bits") +
# 6. Facet
facet_wrap(...)
Named for it's visual similarity to the Manhattan skyline of singular towering skyscrapers scattered above low-level buildings, this plot is commonly used for visualizing genomic markers across an ordered linear axis like a series of chromosomes. The y-axis of the values can represent LOD scores or proportions of marker representation. This is frequently used to visualize GWAS or mapping data.
This is another scatterplot that we can generate directly in ggplot with some effort/setup or we can use a package like qqman.
First let's look at the kind of data. Fun fact! You can read in archived/zipped data as well, although be careful, GWAS files can be quite large!
load(...)
str(..., give.attr = FALSE)
manhattan() function to generate a plot¶To generate our Manhattan plot we can use the manhattan() wrapper function which will require four sets of parameter information:
chr: the column name with the chromosome information for a given SNP.bp: the basepair position of the SNP.snp: the ID of the SNP usually a RefSNP ID (RSID) of some kind.p: the p-value for that particular SNP generated from the case vs. control analysis. This will be converted to a $-log10(p)$ value for the y-axis.# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=10)
GWAS_data.df %>%
...(., chr="CHR", bp="POS", snp="rsid", p="p.value")
highlight parameter to identify markers of interest¶Now that we've plotted our Manhattan plot, we can see some horizontal lines corresponding to minimum p-values of 1.0e-5 and 5.0e-8 although the significance of these can depend on sample size and allele frequencies of your SNPs.
Either way, you can highlight SNP results based on a provided list. In this case, we can generate a list ourselves by filter p-values.
# Adjust our plot window size according to the expected output
options(repr.plot.width=20, repr.plot.height=10)
snp_candidates <-
# Pass along our GWAS data
GWAS_data.df %>%
# Filter it for low p-values
filter(-log10(p.value) >= 5) %>%
# Select SNP candidates
select(...) %>%
unique() %>%
unlist()
# snp_candidates
# Plot our Manhattan plot
manhattan(GWAS_data.df,
chr="CHR",
bp="POS",
snp="rsid",
p="p.value",
highlight = ...)
With this final lecture we've covered the spectrum of visualizations covering the most basic of scatter- and barplots, graduating to boxplots, violin plots, and their variants. We've covered high-throughput data visualizations including volcano plots, heatmaps, and principal component analysis. Furthermore we looked at how simple the projection of high-dimension data can be with t-SNE and UMAP.
We finished our course today covering phylogenetic trees, network graphs, and some sequence analysis visualizations. Overall you now have the tools to wrangle data that may appear in all sorts of formats along with a better understanding of when and how to prepare some of the most common data visualizations in your burgeoning science careers.
Congratulations and good luck on your data science journey!
This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.
| Assignment breakdown | ||
|---|---|---|
| Code | 50% | - Does it follow best practices? |
| - Does it make good use of available packages? | ||
| - Was data prepared properly | ||
| Answers and Output | 50% | - Is output based on the correct dataset? |
| - Are groupings appropriate | ||
| - Are correct titles/axes/legends correct? | ||
| - Is interpretation of the graphs correct? |
Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.
You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 12pm on April 29th, 2021.
The book of ggtree from the Yu Lab: https://yulab-smu.top/treedata-book/index.html
Chapter 10 of ggtree with amazing and complicated plots: https://yulab-smu.top/treedata-book/chapter10.html
More information on the ggraph package: https://cran.r-project.org/web/packages/ggraph/ggraph.pdf
ggseqlogo package information: https://cran.r-project.org/web/packages/ggseqlogo/ggseqlogo.pdf
Manhattan plot tutorial: https://www.r-graph-gallery.com/101_Manhattan_plot.html
GWAS p-value threshold: https://www.nature.com/articles/s41380-020-0670-3?proof=t
More on GWAS thresholds for significance: https://www.nature.com/articles/ejhg2015269.pdf